12 research outputs found

    FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction

    Full text link
    DeepFake based digital facial forgery is threatening public media security, especially when lip manipulation has been used in talking face generation, and the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity are hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face videos also becomes inevitable. It's found that the optical flow of the fake talking face video is disordered especially in the lip region while the optical flow of the real video changes regularly, which means the motion feature from optical flow is useful to capture manipulation cues. In this study, a fake talking face detection network (FTFDNet) is proposed by incorporating visual, audio and motion features using an efficient cross-modal fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architecture by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods not only on the established fake talking face detection dataset (FTFDD) but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).Comment: arXiv admin note: substantial text overlap with arXiv:2203.0517

    Visual tracking based on semantic and similarity learning

    No full text
    We present a method by combining the similarity and semantic features of a target to improve tracking performance in video sequences. Trackers based on Siamese networks have achieved success in recent competitions and databases through learning similarity according to binary labels. Unfortunately, such weak labels result in limiting the discriminative ability of the learned feature, thus it is difficult to identify the target itself from the distractors that have the same class. The authors observe that the inter‐class semantic features benefit to increase the separation between the target and the background, even distractors. Therefore, they proposed a network architecture which uses both similarity and semantic branches to obtain more discriminative features for locating the target accuracy in new frames. The large‐scale ImageNet VID dataset is employed to train the network. Even in the presence of background clutter, visual distortion, and distractors, the proposed method still maintains following the target. They test their method with the open benchmarks OTB and UAV123. The results show that their combined approach significantly improves the tracking ability relative to trackers using similarity or semantic features alone

    Robust visual tracking based on watershed regions

    No full text
    Robust visual tracking is a very challenging problem especially when the target undergoes large appearance variation. In this study, the authors propose an efficient and effective tracker based on watershed regions. As middle‐level visual cues, watershed regions contain more semantics information than low‐level features, and reflect more structure information than high‐level model. First, the authors manually select the target template in initial frame, and predict the target candidate in the next frame using motion prediction. Then, the authors utilise marker‐based watershed algorithm to obtain the watershed regions of target template and candidate template, and describe each region with multiple features. Next, the authors calculate the nearest neighbour in feature space to match the watershed regions and construct an affine relation from target template to candidate template. Finally, the authors resolve the affine relation to calculate the final tracking result, and update the template for the following tracking. The authors test their tracker on some challenging sequences with appearance variation range from illumination change, partial occlusion, pose change to background clutters and compare it with some state‐of‐the‐art works. Experiment results indicate that the proposed tracker is robust to the large appearance variation and exceeds the state‐of‐the‐art trackers in most situations

    Push for quantization: Deep fisher hashing

    No full text
    Current massive datasets demand light-weight access for analysis. Discrete hashing methods are thus beneficial because they map high-dimensional data to compact binary codes that are efficient to store and process, while preserving semantic similarity. To optimize powerful deep learning methods for image hashing, gradient-based methods are required. Binary codes, however, are discrete and thus have no continuous derivatives. Relaxing the problem by solving it in a continuous space and then quantizing the solution is not guaranteed to yield separable binary codes. The quantization needs to be included in the optimization. In this paper we push for quantization: We optimize maximum class separability in the binary space. We introduce a margin on distances between dissimilar image pairs as measured in the binary space. In addition to pair-wise distances, we draw inspiration from Fisher's Linear Discriminant Analysis (Fisher LDA) to maximize the binary distances between classes and at the same time minimize the binary distance of images within the same class. Experiments on CIFAR-10, NUS-WIDE and ImageNet100 demonstrate compact codes comparing favorably to the current state of the art.</p

    Multi‐scale mean shift tracking

    No full text
    In this study, a three‐dimensional mean shift tracking algorithm, which combines the multi‐scale model and background weighted spatial histogram, is proposed to address the problem of scale estimation under the framework of mean shift tracking. The target template is modelled with multi‐scale model and described with three‐dimensional spatial histogram. The tracking algorithm is implemented by three‐dimensional mean shift iteration, which translates the problem of scale estimation in two‐dimensional image plane into the localisation in three‐dimensional image space. To enhance the robustness, the background weighted histogram is employed to suppress the background information in the target candidate model. Firstly, the multi‐scale model and three‐dimensional spatial histogram are introduced to represent the target template. Then, the three‐dimensional mean shift iteration formulation is derived based on the similarity measure between the target model and the target candidate model. Finally, a multi‐scale mean shift tracking algorithm combining multi‐scale model and background weighted spatial histogram is proposed. The proposed algorithm is evaluated on some challenging sequences which contain scale changed targets and other complex appearance variations in comparison with three representative mean shift based tracking algorithms. Both the qualitative results and quantitative analysis indicate that the proposed algorithm outperforms the referenced algorithms in both tracking precision and scale estimation

    Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

    Full text link
    Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study.Comment: 13 pages, 8figure

    Push for quantization: Deep fisher hashing

    No full text
    Current massive datasets demand light-weight access for analysis. Discrete hashing methods are thus beneficial because they map high-dimensional data to compact binary codes that are efficient to store and process, while preserving semantic similarity. To optimize powerful deep learning methods for image hashing, gradient-based methods are required. Binary codes, however, are discrete and thus have no continuous derivatives. Relaxing the problem by solving it in a continuous space and then quantizing the solution is not guaranteed to yield separable binary codes. The quantization needs to be included in the optimization. In this paper we push for quantization: We optimize maximum class separability in the binary space. We introduce a margin on distances between dissimilar image pairs as measured in the binary space. In addition to pair-wise distances, we draw inspiration from Fisher's Linear Discriminant Analysis (Fisher LDA) to maximize the binary distances between classes and at the same time minimize the binary distance of images within the same class. Experiments on CIFAR-10, NUS-WIDE and ImageNet100 demonstrate compact codes comparing favorably to the current state of the art.Pattern Recognition and Bioinformatic

    Recent progress in self-repairing coatings for corrosion protection on magnesium alloys and perspective of porous solids as novel carrier and barrier

    No full text
    Featuring low density and high specific strength, magnesium (Mg) alloys have attracted wide interests in the fields of portable devices and automotive industry. However, the active chemical and electrochemical properties make them susceptible to corrosion in humid, seawater, soil, and chemical medium. Various strategies have revealed certain merits of protecting Mg alloys. Therein, engineering self-repairing coatings is considered as an effective strategy, because they can enable the timely repair for damaged areas, which brings about long-term protection for Mg alloys. In this review, self-repairing coatings on Mg alloys are summarized from two aspects, namely shape restoring coatings and function restoring coatings. Shape restoring coatings benefit for swelling, shrinking, or reassociating reversible chemical bonds to return to the original state and morphology when coatings broken; function self-repairing coatings depend on the release of inhibitors to generate new passive layers on the damaged areas. With the advancement of coating research and to fulfill the demanding requirements of applications, it is an inevitable trend to develop coatings that can integrate multiple functions (such as stimulus response, self-repairing, corrosion warning, and so on). As a novel carrier and barrier, porous solids, especially covalent organic frameworks (COFs), have been respected as the future development of self-repairing coatings on Mg alloys, due to their unique, diverse structures and adjustable functions
    corecore